—Author: Samantha Anderson—
—Last updated: 2019.06.17—
#This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code.
#Try executing this chunk by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Ctrl+Shift+Enter*.
The purpose of this notebook is to analyze the Sort-Seq data of the spike-in lbirary. This library is made of the 100 individual constructs that can be found in:
/data07/sanderson/methodsPaper/IndividualFlow/RAnalysis/100Individual/Summed\ Individual.Rmd.
The constructs were mixed, sorted, and sequenced as described below. The purpose of these experiments were to confirm that the sort-seq data can reconstruct individual variant GFP levels from sequencing data.
Each of the 100 samples were picked from separate LB/amp plates and grown overnight in 200 uL of LB/amp or M9 maltose media with amp in 96 well plates. In the morning, 10 uL of each sample were combined, diluted into PBS, and allowed to sit for at least 10 minutes at room temperature. I flowed the spike-in library and controls at <10,000 events per second with the below settings. Each day, the library was sorted into a different set of 7 GFP gates after being gates on cells and singlets. The raw Sony SH800 data can be found on the file server at:
smb://fs.biochem.wisc.edu/groups/SenesGrp/General/SMA/Cell\ Sorting/
2019.01.26
2019.01.28 (Maltose)
2019.03.15
2019.03.21
2019.03.22
2019.03.25
I set two gates in FlowJo to isolate which cells were appropriate to measure. The firstgate was a circle on Forward vs Side scattering area to isolate cells from debris. The second gate was Forward Area vs Foward Height to exclude doublets from the sample. The GFP histogram data was analyzed by the 7 gates. The subsequent analyses were done without using FlowJo. Since it was very difficult to set the gates on FlowJo exactly the same as the gates on the Sony, I decided to extract the summary statistics data from the Sony software directly. This data can be found here:
smb://fs.biochem.wisc.edu/groups/SenesGrp/General/SMA/Cell\ Sorting/
IIndividualClonesMix.xlxs
Pools of sorted cells were grown until an OD600 of ~0.1 and miniprepped. The minipreps were PCRed and submitted to NGS as described in NGS protocol on the Senes lab dropbox. The submission files can be found at:
/data07/sanderson/methodsPaper/NGS\ Submissions/
20190208/AndersonSubmission_ManualMixture.xlsx
201904/Anderson_SpikeIn2.xlsx
I only look at the forward samples because they generally have better quality.
Raw .fastq Files
/data07/sanderson/methodsPaper/NGS Submissions/20190208
/data07/sanderson/methodsPaper/NGS Submissions/201904
Perl Program
/data07/sanderson/methodsPaper/NGS Submissions/ngsAnalysis_Limited.pl
For maltose analysis, cut off is set at >2
Cut off set at >2 copies of a sequence for the reconstruction
Quality cutoff is set at:
Q = Q - 33
P = 10^(-Q/10)
P <= 1
Reference Sequenes
/data07/sanderson/methodsPaper/NGS Submissions/20190208/referenceSeqs.txt
/data07/sanderson/methodsPaper/NGS Submissions/201904/referenceSeqs.txt
Perl Output (01F.txt and 01R.txt)
/data07/sanderson/methodsPaper/NGS Submissions/20190208
/data07/sanderson/methodsPaper/NGS Submissions/201904
Output: ID Reads Percent
library(tidyverse) #General tidyverse stuff including ggplot2
library(reshape2)
library(dplyr)
library(data.table)
library(mcr)
library(gridExtra)
library(grid)
library(plyr)
Import data from Perl
LBdata <- read.csv("/data07/sanderson/methodsPaper/NGS_Submissions/20190208/derivedData/1F.txt", sep = "\t", skip = 3, header = F)
colnames(LBdata) = c("ID", "LBCount", "LBPercent")
LBlead = read.csv("/data07/sanderson/methodsPaper/NGS_Submissions/20190208/derivedData/1F.txt", skip = 1, header = FALSE, nrows = 1, sep = "\t", stringsAsFactors = FALSE)
LBtotalSeq = strtoi(substr(LBlead[1,1], 22, nchar(LBlead[1,1])))
LBlead = read.csv("/data07/sanderson/methodsPaper/NGS_Submissions/20190208/derivedData/1F.txt", skip = 2, header = FALSE, nrows = 1, sep = "\t", stringsAsFactors = FALSE)
poorSeq = strtoi(substr(LBlead[1,1], 17, nchar(LBlead[1,1])))
noStart = strtoi(substr(LBlead[1,2], 10, nchar(LBlead[1,2])))
noEnd = strtoi(substr(LBlead[1,3], 9, nchar(LBlead[1,3])))
LBtotalSeq = LBtotalSeq - poorSeq - noStart - noEnd
maltData <- read.csv("/data07/sanderson/methodsPaper/NGS_Submissions/20190208/derivedData/9F.txt", sep = "\t", skip = 3, header = F, stringsAsFactors = FALSE)
colnames(maltData) = c("ID", "maltCount", "maltPercent")
maltlead = read.csv("/data07/sanderson/methodsPaper/NGS_Submissions/20190208/derivedData/9F.txt", skip = 1, header = FALSE, nrows = 1, sep = "\t", stringsAsFactors = FALSE)
maltTotalSeq = strtoi(substr(maltlead[1,1], 22, nchar(maltlead[1,1])))
maltlead = read.csv("/data07/sanderson/methodsPaper/NGS_Submissions/20190208/derivedData/9F.txt", skip = 2, header = FALSE, nrows = 1, sep = "\t", stringsAsFactors = FALSE)
poorSeq = strtoi(substr(maltlead[1,1], 17, nchar(maltlead[1,1])))
noStart = strtoi(substr(maltlead[1,2], 10, nchar(maltlead[1,2])))
noEnd = strtoi(substr(maltlead[1,3], 9, nchar(maltlead[1,3])))
maltTotalSeq = maltTotalSeq - poorSeq - noStart - noEnd
summedData = merge(LBdata, maltData, by.x = "ID", all.x = TRUE, all.y = TRUE)
summedData[is.na(summedData)] <- 0
allData <- melt(summedData[,c('ID', 'LBPercent', 'maltPercent')], id.vars = 1)
p <- ggplot(data = allData, aes(x = ID, y = value))
p + geom_bar(aes(fill = variable),stat = "identity", position = position_dodge(0.9)) + labs (x = "Sample ID", y = "Percentage of Sequence Reads") + ggtitle("Maltose vs. LB Library Composition") + theme(axis.text.x = element_text(angle = 90, hjust = 1), text = element_text(size = 30))
Negative values mean that there was a greater percentage of the construct in maltose than in LB, meaning it increased and is “good maltose.” Positive values mean that there was a greater percentage of the construct in LB than in maltose, meaning it decreased and is “bad maltose.”
summedData$Change = (summedData$LBPercent - summedData$maltPercent)/ summedData$LBPercent
for (i in 1:nrow(summedData)){
if (summedData$ID[i] == "1E01" || summedData$ID[i] == "1E11" || summedData$ID[i] == "2E11" || summedData$ID[i] == "2F12" || summedData$ID[i] == "2H11"){
summedData$label[i] <- "BadMalt"
} else if (summedData$ID[i] == "GpA" || summedData$ID[i] == "G83I"){
summedData$label[i] <- "Ctrl"
} else {
summedData$label[i] <- "Test"
}
}
p2 <- ggplot(data = summedData, aes(x = reorder(factor(ID), Change), y = Change, fill = label))
p2 + geom_bar(stat = "identity", position = position_dodge(0.9)) + labs (x = "Sample ID", y = "Percentage of Sequence Reads") +
ggtitle("Change between LB and Maltose") + theme(axis.text.x = element_text(angle = 90, hjust = 1),
text = element_text(size = 30, family = "mono"))
controlData <- read.csv("/run/user/1008/gvfs/smb-share:server=fs.biochem.wisc.edu,share=groups/SenesGrp/General/SMA/Cell Sorting/Experiment 01262019 100210 AM/GateB/20190126_GpA_B.csv", sep = ",", stringsAsFactors = FALSE, header = TRUE, colClasses = c(rep("NULL", 7), "double", "NULL", "NULL"))
setwd("/run/user/1008/gvfs/smb-share:server=fs.biochem.wisc.edu,share=groups/SenesGrp/General/SMA/Cell Sorting/Experiment 01262019 100210 AM/GateB")
filenames <- list.files()
gateSummaries <- data.frame(Date = character(),
Sample = character(),
Gate = character(),
Min = double(),
Max = double(),
Mean = double(),
Median = double(),
stringsAsFactors = FALSE)
data.list <- lapply(filenames, function(i){
tmp <- read.csv(i, sep = ",", stringsAsFactors = FALSE, header = TRUE,
colClasses = c(rep("NULL", 7), "double", "NULL", "NULL"))
names(tmp) <- i
i
})
for (i in filenames){
title <- strsplit(i, "_")
Date <- title[[1]][1]
Sample <- title[[1]][2]
end <- title[[1]][3]
tmp <- strsplit(end, ".", fixed = TRUE)
Gate <- tmp[[1]][1]
}
Import data from Perl output
sortInfo <- read.csv("/data07/sanderson/methodsPaper/sortSeq/SpikeIn/derivedData/GateInfo2.csv", sep = "\t", stringsAsFactors = FALSE, header = TRUE)
sortInfo$Reads <- NA
allGoodSeqs = c()
allSeqs <- data.frame(ID = character(), Reads = integer(), Percent = double(), Date = character(), Bin = character())
for (i in c(2:8)){ #Rep 1
filename = paste("/data07/sanderson/methodsPaper/NGS_Submissions/20190208/derivedData/", as.character(i), "F.txt", sep = "")
#Get read counts
top = read.csv(filename, skip = 1, header = FALSE, nrows = 1, sep = "\t", stringsAsFactors = FALSE)
totalSeq = strtoi(substr(top[1,1], 22, nchar(top[1,1])))
top = read.csv(filename, skip = 2, header = FALSE, nrows = 1, sep = "\t", stringsAsFactors = FALSE)
poorSeq = strtoi(substr(top[1,1], 17, nchar(top[1,1])))
noStart = strtoi(substr(top[1,2], 10, nchar(top[1,2])))
noEnd = strtoi(substr(top[1,3], 9, nchar(top[1,3])))
goodSeqs = totalSeq - poorSeq - noStart - noEnd
gate = as.character(i - 2)
sortInfo[i-1, 12] <- goodSeqs
#Get data
data <- read.csv(filename, sep = "\t", skip = 3, header = F)
sample = paste("C", gate, sep = "")
colnames(data) = c("ID", "Reads", "Percent")
data$Date <- c("2019.01.26")
data$Bin <- c(sample)
allSeqs <- rbind(allSeqs, data)
}
for (i in c(1:31)){
if (i <10){
filename = paste("/data07/sanderson/methodsPaper/NGS_Submissions/201904/derivedData/0", as.character(i), "F.txt", sep = "")
} else {
filename = paste("/data07/sanderson/methodsPaper/NGS_Submissions/201904/derivedData/", as.character(i), "F.txt", sep = "")
}
#Get read counts
top = read.csv(filename, skip = 1, header = FALSE, nrows = 1, sep = "\t", stringsAsFactors = FALSE)
totalSeq = strtoi(substr(top[1,1], 22, nchar(top[1,1])))
top = read.csv(filename, skip = 2, header = FALSE, nrows = 1, sep = "\t", stringsAsFactors = FALSE)
poorSeq = strtoi(substr(top[1,1], 17, nchar(top[1,1])))
noStart = strtoi(substr(top[1,2], 10, nchar(top[1,2])))
noEnd = strtoi(substr(top[1,3], 9, nchar(top[1,3])))
goodSeqs = totalSeq - poorSeq - noStart - noEnd
sortInfo[i+7, 12] <- goodSeqs
#Get data
data <- read.csv(filename, sep = "\t", skip = 3, header = F)
colnames(data) = c("ID", "Reads", "Percent")
if (i < 8){
data$Date <- c("2019.03.15")
gate = as.character(i - 1)
} else if (7 < i && i < 16) {
data$Date <- c("2019.03.21")
gate = as.character(i - 8)
} else if (15 < i && i < 24) {
data$Date <- c("2019.03.22")
gate = as.character(i - 16)
} else if (23 < i) {
data$Date <- c("2019.03.25")
gate = as.character(i - 24)
}
sample = paste("C", gate, sep = "")
data$Bin <- c(sample)
allSeqs <- rbind(allSeqs, data)
}
Table containg the GFP and NGS data for the gates by date
sortInfo
Table containing the number of reads for each construct by day and by bin.
2 Cutoff: After this table, I removed 1D10 from 2019.03.15 because it only shows up for 4 conts in 1 bin, significantly skewing the calculations 100 Cutoff: After this table, I removed 2F11 entirely because it only shows up in 1 bin on 2 different dats
allSeqs
allSeqs <- allSeqs[-c(1825),]
allSeqs <- allSeqs[-c(1240),]
allSeqs <- allSeqs[-c(526),]
#allSeqs <- allSeqs[-c(3244),]
These reads are after sorting out poor quality sequences and sequences that do not have both accurate primers.
dodge <- position_dodge(width = 0.9)
p <- ggplot(data = sortInfo, aes(x = factor(Bin), y = Reads, fill = factor(Date)))
p + geom_bar(stat = "identity", position = position_dodge(0.9)) +
labs (x = "Gate", y = "Total Reads") +
ggtitle("NGS Read Distribution by Gate and Date") + scale_fill_discrete(name = "Date") +
theme(plot.title = element_text(hjust = 0.5))
sortInfo$ReflowMean <- as.numeric(as.vector(sortInfo$ReflowMean))
sortInfo$SetMean <- as.numeric(as.vector(sortInfo$SetMean))
p <- ggplot(data = sortInfo, aes(x=ReflowMean, y=Fraction, color = Date, group = Date)) + geom_point() +scale_x_log10() +
geom_line() + labs (x = "Mean Gate GFP Flourescence (RFU)", y = "Fraction of Population") +
ggtitle("GFP Mean of Reflow") + theme(plot.title = element_text(hjust = 0.5)) +
theme(text = element_text(size = 25))
p2 <- ggplot(data = sortInfo, aes(x=SetMean, y=Fraction, color = Date, group = Date)) + geom_point() +scale_x_log10() +
geom_line() + labs (x = "Mean Gate GFP Flourescence (RFU)", y = "Fraction of Population") +
ggtitle("GFP Mean of Set Bin") + theme(plot.title = element_text(hjust = 0.5), text = element_text(size = 25),
legend.position="none")
grid.arrange(p, p2, ncol = 2)
sortInfo$ReflowMedian <- as.numeric(as.vector(sortInfo$ReflowMedian))
sortInfo$SetMedian <- as.numeric(as.vector(sortInfo$SetMedian))
p <- ggplot(data = sortInfo, aes(x=ReflowMedian, y=Fraction, color = Date, group = Date)) + geom_point() +scale_x_log10() +
geom_line() + labs (x = "Median Gate GFP Flourescence (RFU)", y = "Fraction of Population") +
ggtitle("GFP Median of Reflow") + theme(plot.title = element_text(hjust = 0.5),
text = element_text(size = 25), legend.position="none")
p1 <- ggplot(data = sortInfo, aes(x=SetMedian, y=Fraction, color = Date, group = Date)) + geom_point() +scale_x_log10() +
geom_line() + labs (x = "Median Gate GFP Flourescence (RFU)", y = "Fraction of Population") +
ggtitle("GFP Median of Set Bin") + theme(plot.title = element_text(hjust = 0.5),
text = element_text(size = 25), legend.position="none")
grid.arrange(p, p1, ncol = 2)
For this method, the GFP level is calculated as a weighted average1. This method normalizes the reads per construct per bin with the fraction of the population that is in that bin.
\[a_{ij} = \left. \frac{f_j \cdot c_{ij}}{\sum_i c_{ij}} \middle/ \sum_j \frac{f_j \cdot c_{ij}}{\sum_i c_{ij}} \right.\] j = bin
i = construct
fj = fraction of the population that sorts to bin j
cij = # of reads of construct i in bin j
aij = fraction of weighted construct i that can be found in bin j
\[\sum_ja_{ij} = 1 \] \[p_i = \sum_ja_{ij} \cdot m_j\] mj = median GFP in bin j
pi = reconstructed GFP value of construct i
1: Kosuri, S., Goodman, D.B., Cambray, G., Mutalik, V.K., Gao, Y., Arkin, A.P., Endy, D., and Church, G.M. (2013). Composability of regulatory sequences controlling transcription and translation in Escherichia coli. Proc. Natl. Acad. Sci. U.S.A. 110, 14024–14029.
keeps <- c("ID", "Date")
denominator <- allSeqs[keeps]
denominator <- unique(denominator)
denominator$Value <- NA
allSeqs$Num <- NA
allSeqs$a <- NA
for (row in 1:nrow(allSeqs)){
date <- allSeqs[row, 4]
bin <- allSeqs[row, 5]
tmp <- sortInfo[which(sortInfo[,1] == date),] #Match Date
binReads <-tmp[which(tmp[,2]==bin), 12] #Get total bin reads
binFraction <- tmp[which(tmp[,2]==bin), 5] #Get bin fraction
numerator = binFraction * allSeqs[row,2] / binReads #Calculate the numerator
allSeqs[row, 6] = numerator #Put numerator in allSeqs
ID <- allSeqs[row, 1]
#Add numerator to the denomintor total
for(dRow in 1:nrow(denominator)){
if(denominator[dRow, 2] == date){
if(denominator[dRow, 1] == ID){
if (is.na(denominator[dRow, 3])) {
denominator[dRow, 3] <- numerator
break
} else {
denominator[dRow, 3] <- denominator[dRow, 3] + numerator
break
}
}
}
}
}
allSeqs$A_reMed <- NA
allSeqs$A_reMea <- NA
allSeqs$A_seMed <- NA
allSeqs$A_seMea <- NA
for (row in 1:nrow(allSeqs)){
date <- allSeqs[row,4]
#Calculate A
tmp <- denominator[which(denominator[,2] == date),] #Match Date
D <-tmp[which(tmp[,1]==allSeqs[row,1]),3] #Match ID and get denominator
A = allSeqs[row,6] / D #Calculate A
allSeqs[row, 7] = A #Put a in allSeqs
#Calculate A_mj
tmp <- sortInfo[which(sortInfo[,1] == date),] #Match Date
reMed <-tmp[which(tmp[,2]==allSeqs[row,5]),3] #Match bin and get reflow median GFP of the bin
reMea <-tmp[which(tmp[,2]==allSeqs[row,5]),4] #Match bin and get reflow mean GFP of the bin
seMed <-tmp[which(tmp[,2]==allSeqs[row,5]),8] #Match bin and get set median GFP of the bin
seMea <-tmp[which(tmp[,2]==allSeqs[row,5]),9] #Match bin and get set mean GFP of the bin
allSeqs[row, 8] <- A * reMed
allSeqs[row, 9] <- A * reMea
allSeqs[row, 10] <- A * seMed
allSeqs[row, 11] <- A * seMea
}
denominator$p_reMed <- NA
denominator$p_reMea <- NA
denominator$p_seMed <- NA
denominator$p_seMea <- NA
for (row in 1:nrow(allSeqs)){
date <- allSeqs[row, 4]
ID <- allSeqs[row, 1]
A_reMed <- allSeqs[row, 8]
A_reMea <- allSeqs[row, 9]
A_seMed <- allSeqs[row, 10]
A_seMea <- allSeqs[row, 11]
#Add A_mj to the P total
for(dRow in 1:nrow(denominator)){
if(denominator[dRow, 2] == date){ #Match date
if(denominator[dRow, 1] == ID){ #Match ID
if (is.na(denominator[dRow, 4])) {
denominator[dRow, 4] <- A_reMed
denominator[dRow, 5] <- A_reMea
denominator[dRow, 6] <- A_seMed
denominator[dRow, 7] <- A_seMea
break
} else {
denominator[dRow, 4] <- denominator[dRow, 4] + A_reMed
denominator[dRow, 5] <- denominator[dRow, 5] + A_reMea
denominator[dRow, 6] <- denominator[dRow, 6] + A_seMed
denominator[dRow, 7] <- denominator[dRow, 7] + A_seMea
break
}
}
}
}
}
Summary of reconstructed GFP values (p) of constructs by date
data_wide <- dcast(denominator, ID ~ Date, value.var = "p_reMed")
data_wide
p <- ggplot(data = denominator, aes(x = factor(ID), y = p_reMed, fill = factor(Date)))
p + geom_bar(stat = "identity", position = position_dodge(0.9)) +
labs (x = "Construct", y = "Reconstructed GFP (RFU") +
ggtitle("Kosuri Weighted Average") + scale_fill_discrete(name = "Date") +
theme(axis.text.x = element_text(angle = 90, hjust = 1), text = element_text(size = 25),
plot.title = element_text(hjust = 0.5))
It is somewhat obvious that the calculations from 2019.01.26 are higher than the rest. I will compare using the day-by-day baseline averages.
Average reconstructed GFP values (p)
stats <- c("ID", "p_reMed", "p_reMea", "p_seMed", "p_seMea")
summedData <- aggregate(.~ID, denominator[stats], FUN = function(x) c(mean = mean(x), sd = sd(x), n = length(x)))
summedData <- do.call(data.frame, summedData)
summedData$p_reMed.se <- summedData$p_reMed.sd / sqrt(summedData$p_reMed.n)
summedData$p_reMea.se <- summedData$p_reMea.sd / sqrt(summedData$p_reMea.n)
summedData$p_seMed.se <- summedData$p_seMed.sd / sqrt(summedData$p_seMed.n)
summedData$p_seMea.se <- summedData$p_seMea.sd / sqrt(summedData$p_seMea.n)
summedData$label <- NA
for (i in 1:nrow(summedData)){
if (summedData$ID[i] == "GpA" || summedData$ID[i] == "G83I"){
summedData$label[i] <- "Ctrl"
} else {
summedData$label[i] <- "Test"
}
}
summedData
dodge <- position_dodge(width = 0.9)
limits <- aes(ymax = summedData$p_seMea.mean + summedData$p_seMea.se,
ymin = summedData$p_seMea.mean - summedData$p_seMea.se)
p <- ggplot(data = summedData, aes(x = reorder(factor(ID), p_seMea.mean), y = p_seMea.mean, fill = label))
p + geom_bar(stat = "identity", position = position_dodge(0.9)) +
geom_errorbar(limits, position = position_dodge(0.9), width = 0.25) +
labs (x = "Construct", y = "Average Reconstructed GFP (RFU") +
ggtitle("Kosuri Weighted Average") + theme(axis.text.x = element_text(angle = 90, hjust = 1), text = element_text(size = 25), plot.title = element_text(hjust = 0.5))
*Error bars are standard of mostly 5 replicates. 2 samples have 4 replicates (1D10 and 1B06) and 1 sample (2G03) only showed up in 1 replicate.
individualData <- read.csv("/data07/sanderson/methodsPaper/IndividualFlow/RAnalysis/100Individual/summaryStatistics.csv", header = TRUE)
individualData$Construct <- revalue(individualData$Construct, c("1A02" = "GpA", "1A03" = "G83I"))
compare <- merge(summedData, individualData, by.x = "ID", by.y = "Construct")
#Calculate best fit line
reg <- lm(p_reMed.mean ~ Median.mean, data = compare)
coeff=coefficients(reg)
eq_reMed = paste0("y = ", round(coeff[2],3), "*x + ", round(coeff[1],0), ", r^2 = ", round(summary(reg)$r.squared, 3))
reg <- lm(p_reMea.mean ~ Median.mean, data = compare)
coeff=coefficients(reg)
eq_reMea = paste0("y = ", round(coeff[2],3), "*x + ", round(coeff[1],0), ", r^2 = ", round(summary(reg)$r.squared, 3))
reg <- lm(p_seMed.mean ~ Median.mean, data = compare)
coeff=coefficients(reg)
eq_seMed = paste0("y = ", round(coeff[2],3), "*x + ", round(coeff[1],0), ", r^2 = ", round(summary(reg)$r.squared, 3))
reg <- lm(p_seMea.mean ~ Median.mean, data = compare)
coeff=coefficients(reg)
eq_seMea = paste0("y = ", round(coeff[2],3), "*x + ", round(coeff[1],0), ", r^2 = ", round(summary(reg)$r.squared, 3))
#Plots
xLimits <- aes(xmax = compare$Median.mean + compare$Median.se, xmin = compare$Median.mean - compare$Median.se)
yLimits <- aes(ymax = compare$p_reMed.mean + compare$p_reMed.se, ymin = compare$p_reMed.mean - compare$p_reMed.se)
p <- ggplot(data = compare, aes(x=Median.mean, y=p_reMed.mean)) + geom_point() + scale_x_log10() +
scale_y_log10(limits = c(30000,150000)) + geom_smooth(method = lm) + labs (x = "Measured Median GFP Flourescence (RFU)",
y = "Reconstructed Median GFP Fluorescence (RFU)", title = "Measured vs. Kosuri Weighted Average (Reflow Median)",
subtitle = eq_reMed) + theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5),
text = element_text(size = 15), axis.title.x=element_blank())
yLimits <- aes(ymax = compare$p_reMea.mean + compare$p_reMea.se, ymin = compare$p_reMea.mean - compare$p_reMea.se)
p1 <- ggplot(data = compare, aes(x=Median.mean, y=p_reMea.mean)) + geom_point() + scale_x_log10() +
scale_y_log10(limits = c(30000,150000)) + geom_smooth(method = lm) +
labs (title = "Measured vs. Kosuri Weighted Average (Reflow Mean)", subtitle = eq_reMea) +
theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5), text = element_text(size = 15),
axis.title.x=element_blank(), axis.title.y=element_blank())
yLimits <- aes(ymax = compare$p_seMed.mean + compare$p_seMed.se, ymin = compare$p_seMed.mean - compare$p_seMed.se)
p2 <- ggplot(data = compare, aes(x=Median.mean, y=p_seMed.mean)) + geom_point() + scale_x_log10() +
scale_y_log10(limits = c(30000,150000)) + geom_smooth(method = lm) + labs (x = "Measured Median GFP Flourescence (RFU)",
y = "Reconstructed Median GFP Fluorescence (RFU)", title = "Measured vs. Kosuri Weighted Average (Set Median)",
subtitle = eq_seMed) + theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5),
text = element_text(size = 15))
yLimits <- aes(ymax = compare$p_seMea.mean + compare$p_seMea.se, ymin = compare$p_seMea.mean - compare$p_seMea.se)
p3 <- ggplot(data = compare, aes(x=Median.mean, y=p_seMea.mean)) + geom_point() + scale_x_log10() +
scale_y_log10(limits = c(30000,150000)) + geom_smooth(method = lm) + labs (x = "Measured Median GFP Flourescence (RFU)",
title = "Measured vs. Kosuri Weighted Average (Set Mean)",
subtitle = eq_seMea) + theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5),
text = element_text(size = 15), axis.title.y=element_blank())
grid.arrange(p, p1, p2, p3, ncol = 2)
p3 + geom_errorbar(yLimits, width = 0) +
geom_errorbarh(xLimits, width = 0)
## Warning: Ignoring unknown parameters: width
## Warning: Removed 1 rows containing missing values (geom_errorbar).
Deming Regression allows for a total squares regression that takes into accoutn x and y error for each of the points. The Pearson correlation is indiciated on the graph as rho.
#Reflow Median
dem.reg <- mcreg(compare$Median.mean, compare$p_reMed.mean, method.reg = "Deming")
pca <- prcomp(~compare$Median.mean+compare$p_reMed.mean, compare)
slp <- with(pca, rotation[2,1] / rotation[1,1])
int <- with(pca, center[2] - slp*center[1])
keeps <- c("p_seMea.mean", "Median.mean")
corr <- compare[keeps]
mat <- cor(corr, use = "all.obs", method = "pearson")
pear <- mat[1,2]
eq = paste0("y = ", round(slp, 3), "*x + ", round(int,0), ", rho = ", round(pear, 3))
p <- ggplot(data = compare, aes(x=Median.mean, y=p_reMed.mean)) + geom_point() +
labs (x = "Measured Median GFP Flourescence (RFU)",
y = "Reconstructed Median GFP Fluorescence (RFU)", title = "Measured vs. Kosuri Weighted Average (Reflow Median): Deming Regression",
subtitle = eq) + theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5)) +
geom_abline(slope = slp, intercept = int)
#Reflow Mean
dem.reg <- mcreg(compare$Median.mean, compare$p_reMea.mean, method.reg = "Deming")
pca <- prcomp(~compare$Median.mean+compare$p_reMea.mean, compare)
slp <- with(pca, rotation[2,1] / rotation[1,1])
int <- with(pca, center[2] - slp*center[1])
keeps <- c("p_seMea.mean", "Median.mean")
corr <- compare[keeps]
mat <- cor(corr, use = "all.obs", method = "pearson")
pear <- mat[1,2]
eq = paste0("y = ", round(slp, 3), "*x + ", round(int,0), ", rho = ", round(pear, 3))
p1 <- ggplot(data = compare, aes(x=Median.mean, y=p_reMea.mean)) + geom_point() +
labs (x = "Measured Median GFP Flourescence (RFU)",
y = "Reconstructed Median GFP Fluorescence (RFU)", title = "Measured vs. Kosuri Weighted Average (Reflow Mean): Deming Regression",
subtitle = eq) + theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5)) +
geom_abline(slope = slp, intercept = int)
#Set Median
dem.reg <- mcreg(compare$Median.mean, compare$p_seMed.mean, method.reg = "Deming")
pca <- prcomp(~compare$Median.mean+compare$p_seMed.mean, compare)
slp <- with(pca, rotation[2,1] / rotation[1,1])
int <- with(pca, center[2] - slp*center[1])
keeps <- c("p_seMea.mean", "Median.mean")
corr <- compare[keeps]
mat <- cor(corr, use = "all.obs", method = "pearson")
pear <- mat[1,2]
eq = paste0("y = ", round(slp, 3), "*x + ", round(int,0), ", rho = ", round(pear, 3))
p2 <- ggplot(data = compare, aes(x=Median.mean, y=p_seMed.mean)) + geom_point() +
labs (x = "Measured Median GFP Flourescence (RFU)",
y = "Reconstructed Median GFP Fluorescence (RFU)", title = "Measured vs. Kosuri Weighted Average (Set Median): Deming Regression",
subtitle = eq) + theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5)) +
geom_abline(slope = slp, intercept = int)
#Set Mean
dem.reg <- mcreg(compare$Median.mean, compare$p_seMea.mean, method.reg = "Deming")
pca <- prcomp(~compare$Median.mean+compare$p_seMea.mean, compare)
slp <- with(pca, rotation[2,1] / rotation[1,1])
int <- with(pca, center[2] - slp*center[1])
keeps <- c("p_seMea.mean", "Median.mean")
corr <- compare[keeps]
mat <- cor(corr, use = "all.obs", method = "pearson")
pear <- mat[1,2]
eq = paste0("y = ", round(slp, 3), "*x + ", round(int,0), ", rho = ", round(pear, 3))
p3 <- ggplot(data = compare, aes(x=Median.mean, y=p_seMea.mean)) + geom_point() +
labs (x = "Measured Median GFP Flourescence (RFU)",
y = "Reconstructed Median GFP Fluorescence (RFU)", title = "Measured vs. Kosuri Weighted Average (Set Mean): Deming Regression",
subtitle = eq) + theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5)) +
geom_abline(slope = slp, intercept = int)
#Plot
grid.arrange(p, p1, p2, p3, ncol = 2)
yLimits <- aes(ymax = compare$p_seMed.mean + compare$p_seMed.se, ymin = compare$p_seMed.mean - compare$p_seMed.se)
p2 + geom_errorbar(yLimits, width = 0) +
geom_errorbarh(xLimits, width = 0)
## Warning: Ignoring unknown parameters: width
For this method, the GFP level is calculated similarily to Kosuri et al.1 However, I changed the choice of flourescence from the median of a reflowed bin to be based in it’s lower and upper limits which is used in Peterman and Levine2.
\[ L_{j+1} = U_j \] \[log U_j = logL_j + w \] j = bin
Lj = Lower limit of bin j
Uj = Upper limit of bin j
w = width of the bin
\[\phi_j = b\sqrt(L_jU_j) \]
\(\phi_j\) = Flourescence value in bin j
b = Multipling factor
\[ b = w/(e^{w/2} - e^{- w/2}) \]
\[ \hat v_i = \sum_{j = 1}^{m} a_{ij} \phi_j\] i = construct
\(\hat v_i\) = ReconstructedGFP Value for construct i
m = Total number of bins
aij = weighted read average of construct i in bin j
2: Peterman, N., and Levine, E. (2016). Sort-seq under the hood: implications of design choices on large-scale characterization of sequence-function relations. BMC Genomics 17, 206.
We use \(r_j\) to denote counts of cells fall in Bin \(j\), and use \(\Phi(x;\mu,\sigma)\) to denote the c.d.f. of a normal distribution with mean \(\mu\) and variance \(\sigma^2\). Then the log-likelihood of first flow histogram is:
\[{\rm log}(L(\mu, \sigma|r)) = \sum_{j = 1}^m r_j \times {\rm log}(\Phi(U_j;\mu,\sigma) - \Phi(L_j;\mu,\sigma)).\]
The MLE of \(\mu\) and \(\sigma\) is: \[(\mu, \sigma) = {\rm argmax}_{\mu,\sigma} {\rm log}(L(\mu, \sigma|r))\]